# Replication files for "Multilevel calibration weighting for survey data"

Replication contact: Eli Ben-Michael (ebenmichael@cmu.edu)

There are 4 directories:
- `data/` contains code and files for generating and recoding the PEW and CCES samples used in the analysis
- `analysis/` contains the code to perform the analyses in the main text and the supplement
- `sims/` contains code to generate simulated data as described in the supplement, and code to create the figures in Section E of the supplement
- `results/` contains the figures and output for the paper. This is created by the scripts in `analysis/` and `sims/` if it does not already exist

# Data Sources

There are two data sources.

- `CCES` data: We directly import the 2016 CCES data from dataverse, and conduct recoding in the file `data/recode_cces.R`.  The generated data is saved in `data/generated/cces.rds`.  Citation included below:
## Ansolabehere, Stephen; Schaffner, Brian F., 2017, "CCES Common Content, 2016", 
## https://doi.org/10.7910/DVN/GDF6Z0, Harvard Dataverse, V4, 
## UNF:6:WhtR8dNtMzReHC295hA4cg== [fileUNF]
## Download data `CCES16_Common_OUTPUT_Feb2018_VV.tab` as .dta
## https://dataverse.harvard.edu/dataset.xhtml?persistentId=doi%3A10.7910/DVN/GDF6Z0

- `Pew` data: The file `data/uploaded/Oct16 public.sav` contains the raw Pew survey.  This data was downloaded from https://www.pewresearch.org/politics/2016/10/27/as-election-nears-voters-divided-over-democracy-and-respect/ on Oct. 11, 2018 using a free academic subscription. We recode the data in the file `data/recode_pre_pew.R` and the generated data is saved in `data/generated/pew.rds`.

These scripts require the following packages (version used)
- tidyverse (1.3.1)
- foreign (0.8-82)
- dataverse (0.3.12)
- readstata13 (0.10.0)

For convenience, the `data/generated` directory is already populated with the recoded data, but can be regenerated with the scripts above.

## Main analysis in `analysis/`
The file `analysis/figure_doc.R` is an R script that creates all of the figures and does the majority of the work for the analysis in the main text and supplement.`analysis/helper_funcs.R` contains a few small helper functions for the analysis.

To run from the command line, navigate to the  `analysis/` directory, then run
```
Rscript figure_doc.R
```
This will create all main text figures and Supplement figures A1 and A2 in the `results/` directory. 

The results were produced under R version 4.2.1. You will need the following packages (version used in paper)
- tidyverse (1.3.1)
- glmnet (4.1-4)
- ranger (0.14.1)
- gbm (2.1.8)
- Matrix (1.4-1)
- ggrepel (0.9.1)
- multical (0.0.1) <- this is the R package associated with the paper, and can be found at github.com/ebenmichael/multical (you will need the devtools package to install from github)

Expect running `analysis/figure_doc.R` to take ~15 minutes.

## Simulation results in `sims/`
The file `sims/sim_figs.R` is a script that creates the figures in Supplement E. It uses pre-computed simulation results found in `sims/results/`. Run the script from the command line in the `sims/` directory
```
Rscript sim_figs.R
```
This will create figures E1 and E2 in the `results/` directory. 

`sims/run_calibration_sim.R` is an R script to generate simulated data, fit the methods, and summarize the results. To produce the two simulation settings in the supplement (using `XX` cores), run the script from the command line in the `sims/` directory:
```
Rscript run_calibration_sim.R 'pscore="pscore_rf"' 'y_model="rvote"' n_cores=XX
Rscript run_calibration_sim  'pscore="pscore_4_lowreg"' 'y_model="out4_lowreg"' n_cores=XX
```

This uses simulation functions from `sims/simulate.R`, and pre-fit model outputs found in `sims/sim_data.csv`, created by `sims/fit_models.R`.

The runtime of `sims/run_calibration_sim.R` depends on the number of cores you use. For 1000 replications on 10 cores, this takes ~1.5 hours


